Learning Approaches to Wrapper Induction
نویسندگان
چکیده
The number, the size, and the dynamics of Internet information sources bears abundant evidence of the need of automation in information extraction (IE). This paper deals with the question of how such extraction mechanisms can automatically be created by invoking learning techniques. The underlying scenario of system-supported IE is putting certain constraints on the available training examples. Therefore, the traditional approaches to formal language learning do not capture the kind of problems to be solved when learning the corresponding extraction mechanisms. We illustrate the resulting differences by studying the problem of learning a particular type of extraction mechanisms (so-called island wrappers). We show how to decompose this learning problem into different subproblems that can be handled independently and in parallel. Moreover, we relate the learning problems on hand to the problems that learning theory papers originally address and point out what they have in common and where the differences are.
منابع مشابه
Active Learning with Strong and Weak Views: A Case Study on Wrapper Induction
Multi-view learners reduce the need for labeled data by exploiting disjoint sub-sets of features (views), each of which is sufficient for learning. Such algorithms assume that each view is a strong view (i.e., perfect learning is possible in each view). We extend the multi-view framework by introducing a novel algorithm, Aggressive Co-Testing, that exploits both strong and weak views; in a weak...
متن کاملView Validation: A Case Study for Wrapper Induction and Text Classification
Wrapper induction algorithms, which use labeled examples to learn extraction rules, are a crucial component of information agents that integrate semi-structured information sources. Multi-view wrapper induction algorithms reduce the amount of training data by exploiting several types of rules (i.e., views), each of which being sufficient to extract the relevant data. All multiview algorithms re...
متن کاملControl of Inductive Bias in Supervised Learning using Evolutionary Computation: A Wrapper-Based Approach
In this chapter, I discuss the problem of feature subset selection for supervised inductive learning approaches to knowledge discovery in databases (KDD), and examine this and related problems in the context of controlling inductive bias. I survey several combinatorial search and optimization approaches to this problem, focusing on datadriven validation-based techniques. In particular, I presen...
متن کاملPopulating Ontologies with Data from OCRed Lists
A flexible, accurate, and efficient method of automatically extracting facts from lists in OCRed documents and inserting them into an ontology would help make those facts machine searchable, queryable, and linkable and expose their rich ontological interrelationships. To work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selectio...
متن کاملWrapper Induction for Information Extraction
Wrapper Induction for Information Extraction by Nicholas Kushmerick Chairperson of Supervisory Committee: Professor Daniel S. Weld Department of Computer Science and Engineering The Internet presents numerous sources of useful information|telephone directories, product catalogs, stock quotes, weather forecasts, etc. Recently, many systems have been built that automatically gather and manipulate...
متن کامل